Approximate String Matching in R

Lukas Koning
library(stringdist)

fuzzy_match <- function(search, match) {
# Convert search term to lowercase
search <- tolower(search)

# Lower case and split match into words
match <- strsplit(tolower(match), "[.,?! ]+")[[1]]

# Initialize best match
best_match <- 0
best_word &l

# Loop over tokens
for(word in match) {
# Calculate string distance
m <- stringdist(search, word)

#Normalize for search length
m <- 100 * (1 - (m / nchar(search)))

# Check if match improved
if(m > best_match) {
best_match <- m
best_word <- word
}
}

return(list(percentage = best_match, word = best_word))
}

search = 'testing'
strings = c(
'This is a test!',
'We are testing fuzzy matching',
'Just a random string'
)

for(str in strings) {
# Matching
m <- fuzzy_match(search, str)
print(sprintf('Text: %-30s -- Word: %-20s -- Matched: %4.1f%%', str, m$word, m$percentage))
}

2d <

Bryan R. Balajadia Here's a function via cosine similarity. Dependency: stringdist package

Fn: cosine.match

# Function to extract string using cosine similarity measure
# Arguments: str.pattern - string pattern; character vector
# string - vector on which to search for the str.pattern
#######################################

cosine.match <- function (str.pattern, string) {
string <- Filter(function(x) !all(c("") %in% x), string)
match.string.loc <- amatch(str.patter
ing, method = "cosine",
maxDist = Inf, q = 1, matchNA = FALSE)
match.string <- string[match.string.loc]
df <- data.frame(str.pattern,match.string.loc, match.string)
return(df)
}

2d <

Oliver Belmans I would advice levenshtein distance. It matches strings with an approximation value.

1d <

Benjamin Uminsky I have used the stringdist package before and it is great because it gives you access to a host of different fuzzy matching algorithms (Levenshtein, Soundex, Lev-Damerau... etc.). However, the draw back that I found is that you only get a single return. So, it will match a single pattern to a vector of patterns you wish to match with, but will only return the single closest match. If that isn't a problem that fantastic. However, if you are looking to find any number of close matches, then you will need to use the agrep(). The drawback to agrep is that it only utilizes Levenshtein to fuzzy match (although you still have the ability to modify the weights and customize your match threshold).

1d <

Benjamin Uminsky The other nice thing about agrep, is that although it will only use a single pattern to match, you can still fit it into a sapply function to cover a whole list of patterns you wish to match with. I regularly use parsapply +agrep to swiftly (using parallel processing) identify fuzzy matched duplicate voters in our massive 5.2 million voter database. It works well, particularly when you spread the computations over 20 cpu's.

1d <

Mislav Šagovac i have used fuzzy match package and it was very slow on big datasets. For me it was better to use string dis package and write your own simple function.

5h <

Mislav Šagovac and it would be good to use a set od string distances. better than only one. the best approach would be to use machine learning method that would learn which match is right.

4h <

Abdelouahed Ben Mhamed you can use the function pmatch()

4h <

Gabriel Gomes I use stringdist package. This package has differente algorithm of distance to string, so I this the best.